Lightweight Multi-Scale Asymmetric Attention Network for Image Super-Resolution

Recently, with the development of convolutional neural networks, single-image super-resolution (SISR) has achieved better performance. However, the practical application of image super-resolution is limited by a large number of parameters and calculations. In this work, we present a lightweight multi-scale asymmetric attention network (MAAN), which consists of a coarse-grained feature block (CFB), fine-grained feature blocks (FFBs), and a reconstruction block (RB). MAAN adopts multiple paths to facilitate information flow and accomplish a better balance of performance and parameters. Specifically, the FFB applies a multi-scale attention residual block (MARB) to capture richer features by exploiting the pixel-to-pixel correlation feature. The asymmetric multi-weights attention blocks (AMABs) in MARB are designed to obtain the attention maps for improving SISR efficiency and readiness. Extensive experimental results show that our method has comparable performance with fewer parameters than the current advanced lightweight SISR.


Introduction
Image super-resolution (SR) is the process of recovering a high-resolution (HR) image from a given low-resolution (LR) image. Several corresponding HR images can be generated from a given LR image, which is fundamentally ill-posed. Recently, many researchers have introduced deep learning (DL) to solve the SR problem. In particular, the domain of single-image SR has achieved remarkable performance using deep convolutional neural network (CNN) techniques [1]. Dong et al. [2] built an end-to-end SR convolutional neural network (SRCNN), which obtained significant performance improvement compared to traditional methods. Kim et al. [3] presented a very deep super-resolution (VDSR) network, which increased the depth of the network to 20 layers and reduced training difficulty by residual learning. Lim et al. [4] designed an enhanced deep super-resolution (EDSR) network with an intense architecture with more than 60 layers, acquiring high reconstruction accuracy. To reduce the network depth and extract diversity features, some researchers studied multiple path networks to obtain various features at multiple contextual scales. Liu et al. [5] proposed a residual feature distillation network (RFDN), which learned more discriminative feature representations through multiple feature distillation connections. The SR network design discussed above is of equal importance for all channels and locations. Furthermore, the attention-based network implemented confirms that not all features are essential for SR. Inspired by SENet [6], Zhang et al. [7] employed a residual channel attention network(RCAN) to enhance the results of SR by exploiting the interdependence with the channel attention residual blocks. In addition, the spatial attention mechanism exploited the spatial information of the feature maps for HR image reconstruction. Liu et al. [8] present a residual feature aggregation network (RFANet) using 1.
We employ fine-grained feature blocks (FFBs) as the backbone module of our framework implementation, which accesses reasonable SR performance with fewer parameters. The multi-scale attention residual block (MARB) of FFBs extracts sufficient multi-scale features for global feature fusion. It enhances asymmetric attention neurons in a larger receptive field to capture richer multi-frequency information features significantly.

2.
We propose an asymmetric multi-weights attention block (AMAB) to enhance feature propagation and further extract high-frequency detail features by adaptive selection among the layers. 3.
MAAN acquires a better trade-off between performance and lightweight compared to the popular models. Overall, our goal is to propose a lightweight model that optimizes the reconstructed image and achieves the desired trade-off between parameters and computation. The contribution of our work is as follows: 1. We employ fine-grained feature blocks (FFBs) as the backbone module of our framework implementation, which accesses reasonable SR performance with fewer parameters. The multi-scale attention residual block (MARB) of FFBs extracts sufficient multi-scale features for global feature fusion. It enhances asymmetric attention neurons in a larger receptive field to capture richer multi-frequency information features significantly. 2. We propose an asymmetric multi-weights attention block (AMAB) to enhance feature propagation and further extract high-frequency detail features by adaptive selection among the layers. 3. MAAN acquires a better trade-off between performance and lightweight compared to the popular models.
The rest of this paper is structured as follows: Section 2 presents related work on lightweight networks and attention mechanisms in image super-resolution. Section 3 shows the MAAN approach in detail. Section 4 illustrates the experiments and provides important arguments for the proposed technique and shows the experimental performance of SISR. Section 5 concludes the paper.

Lightweight Super-Resolution Networks
To further extend the SR model to mobile device applications, lightweight models have attracted the attention of researchers on how to decrease the number of parameters and computation cost. The deeply recursive convolutional network (DRCN) [13] utilized recursive neural networks to employ a single convolutional layer without including many parameters. The Laplacian pyramid SR network (LapSRN) [14] reconstructed high-resolution images by learning residuals in convolutional layers with step-by-step scaling. To better balance performance and reasoning application, the information distillation network (IDN) [15] effectively combined the characteristics of a global long path and a local short path, which achieved lightweight and efficient reconstruction. Multiple information distillation blocks were introduced into the IMDN [10] to increase the receptive field, which was fused with stratified information through channel attention. The lightweight The rest of this paper is structured as follows: Section 2 presents related work on lightweight networks and attention mechanisms in image super-resolution. Section 3 shows the MAAN approach in detail. Section 4 illustrates the experiments and provides important arguments for the proposed technique and shows the experimental performance of SISR. Section 5 concludes the paper.

Lightweight Super-Resolution Networks
To further extend the SR model to mobile device applications, lightweight models have attracted the attention of researchers on how to decrease the number of parameters and computation cost. The deeply recursive convolutional network (DRCN) [13] utilized recursive neural networks to employ a single convolutional layer without including many parameters. The Laplacian pyramid SR network (LapSRN) [14] reconstructed high-resolution images by learning residuals in convolutional layers with step-by-step scaling. To better balance performance and reasoning application, the information distillation network (IDN) [15] effectively combined the characteristics of a global long path and a local short path, which achieved lightweight and efficient reconstruction. Multiple information distillation blocks were introduced into the IMDN [10] to increase the receptive field, which was fused with stratified information through channel attention. The lightweight enhanced SR CNN (LESR-CNN) [16] adopted a heterogeneous structure, improving network SR performance by combining low-frequency with high-frequency features. The asymmetric CNN (ACNet) [17] utilized asymmetric convolution to construct hierarchical structure features for adaptively combining local and global information. The multi-scale attention network (MSAN) [18] adopted cascading multiple multi-scale attention blocks and split channel characteristics to further improve performance. Even though the number of lightweight SR methods has grown significantly, it is hard to balance reconstruction accuracy and model capacity.
In some methods, multi-scale feature extraction via dilated convolution leads to capturing redundant contextual information, while bringing in some non-essential parameters and computational costs. In others, excessive scaling of model parameters makes the image too smooth to better capture the perceptual difference between the model output and the true-value image. Hence, we aim to build a lightweight network, utilizing multiple paths to facilitate information flow and accomplish better information exchange. Accordingly, our study introduces a novel multi-scale block with simple 3 × 3 convolutional combinations to realize the aggregation of different scales and levels of information. Concurrently, channel scaling with asymmetric convolution further reduces parameters and computational costs.

Attention Mechanism
The attention mechanism assigns more priority to specific pixels, which leads to better data processing than others. Recently, the attention mechanism has been widely used in SR to obtain significant features by inhibiting insignificant features. The channel attention mechanism only focused on each channel feature, which computed one-dimensional weights multiplied by channel pixels. Niu et al. [19] presented the holistic attention network (HAN), which fully employs more informative features across layers, channels, and positions for selectively capture. The dense residual Laplacian network (DRLN) [20] proposed a Laplacian pyramidal attention mechanism for learning multiple frequency features. The sparse mask SR (SMSR) [21] explored spatial masks to improve the inference efficiency of SR networks. The SMSR learned to identify "significant" regions in contrast to channel masks. We observe that existing attention modules focus on channel attention or spatial attention, which limits the flexibility of the network to learn 1D and 2D attention weights. SimAM [22] proposed 3D attention weights to refine the feature map in a layer without adding parameters to the original networks. The SimAM module had excellent performance on image classification or object detection.
The attention mechanism still has a lot of room for improvement between accuracy and model capacity. Inspired by SimAM, our study introduces a new attention module AMAB, which identifies significant information by exploring relationships between interchannel and intra-channel and facilitates the extraction of diverse features, as well as further improving performance with a small number of parameters and computations.

Network Architecture
In this section, our lightweight and efficient MAAN is employed. MAAN consists of three main components: coarse-grained feature block (CFB), fine-grained feature blocks (FFBs), and reconstruction block (RB), as depicted in Figure 2. We represent the LR image, the HR image, and the SR image, respectively, as I LR , I HR , and I SR .  Firstly, the input is processed by the CFB. We extract coarse-grained features via only one 3 × 3 convolution layer for lightweight design. The CFB block can be formulated as follows: where ( ) CFB f ⋅ denotes the operation of CFB. 0 x is the coarse-grained features, which is used as input to the fine-grained feature block (FFB) for deep feature extraction. Secondly, the FFB is the core step for extracting high-frequency features. To fully utilize the image features of the CFB block, we utilize multiple paths to further refine the features and gather various features. The specific progress can be expressed as follows:  Firstly, the input is processed by the CFB. We extract coarse-grained features via only one 3 × 3 convolution layer for lightweight design. The CFB block can be formulated as follows: x where f CFB (·) denotes the operation of CFB. x 0 is the coarse-grained features, which is used as input to the fine-grained feature block (FFB) for deep feature extraction.
Secondly, the FFB is the core step for extracting high-frequency features. To fully utilize the image features of the CFB block, we utilize multiple paths to further refine the features and gather various features. The specific progress can be expressed as follows: where f FFB (·) denotes the operation of FFB, where x i−1 and x i represent the input and the output respectively of the i-th FFB block. Finally, in the last stage of the model, we reduce artifacts by using an upsampling operation with sub-pixel convolution, and the enlarged features are mapped to the SR image through a 3 × 3 convolution layer. As shown in Figure 2, x 0 and x i are transmitted to the reconstruction block, f RB , via a global residual connection.
Hence, MAAN improves the quality of the final reconstruction with a small cost in parameters. It aggregates features from multiple fields of perception to collect rich contextual information for low-resolution to high-resolution mapping, and it enables a more detailed image to be reconstructed. The super-resolved image, I SR , can be expressed by: We adopt L1 [23] as the loss function. It can be used to minimize the difference between the predicted SR image and the given HR image to train the MAAN for SR, where θ represents the learning parameter, L represents the loss function. Given a training set , the loss function can be formulated as follows:

Fine-Grained Feature Block
As depicted in Figure 3, our FFB is essentially a multiple paths module, which can refine the features in terms of spatial context and produce better information exchange through multiple paths of information flow. FFB is constructed using MARB, AMAB, and 1 × 1 convolutions. FFB utilizes a channel segmentation operation with multiple paths, which divides the input features into two parts. The upper part is retained for MARB operation, and the lower part is compressed into 1 × 1 convolution to extract features. f MARB (·) represents the operation of MARB, each branch is defined as follows: The concatenated features of multiple branches are fused by a convolution operation with 1 × 1 kernel size. Then, AMAB is applied to significantly enhance the feature flow, allowing higher weights to be assigned to more important features and high-frequency refining details. It can be expressed as where [F 1 , F 2 , F 3 , F 4 ] denotes the concatenation of aggregated features. C k×k denotes the convolution operation with k × k kernel size. f AMAB (·) is defined asymmetric multiweights attention block.
refining details. It can be expressed as [ , , , ] F F F F denotes the concatenation of aggregated features. k k C × denotes the convolution operation with k × k kernel size.
( ) Figure 3. The structure of our proposed the FFB.

Multi-Scale Attention Residual Block
When feature extraction is carried out through the convolution kernel with a fixed scale, the ability of network reconstruction is limited by the local feature information. Multi-scale attention residual blocks can enlarge the receptive field and improve computer vision performance. Chen et al. [24] addressed multi-scale feature extraction by dilation convolution and proposed an encoding-decoding image segmentation method, called DeepLabV3+. However, this method directly concatenated features at different scales, which made it difficult to merge this information. To solve the issue, we implemented a new module MARB, which can magnify the receptive field. MARB can employ an attention mechanism to significantly improve the extraction of high-frequency detail

Multi-Scale Attention Residual Block
When feature extraction is carried out through the convolution kernel with a fixed scale, the ability of network reconstruction is limited by the local feature information. Multi-scale attention residual blocks can enlarge the receptive field and improve computer vision performance. Chen et al. [24] addressed multi-scale feature extraction by dilation convolution and proposed an encoding-decoding image segmentation method, called DeepLabV3+. However, this method directly concatenated features at different scales, which made it difficult to merge this information. To solve the issue, we implemented a new module MARB, which can magnify the receptive field. MARB can employ an attention mechanism to significantly improve the extraction of high-frequency detail features and adopt residual learning to reduce gradient disappearance and facilitate information flow.
As depicted in Figure 4, MARB applies multiple paths to combine the multi-scale features, with one 3 × 3 convolution layer at the top and two 3 × 3 convolution layers at the bottom to expand the perceptual field and achieve better feature correlation. It can be expressed as follows: features and adopt residual learning to reduce gradient disappearance and facilitate information flow. As depicted in Figure 4, MARB applies multiple paths to combine the multi-scale features, with one 3 × 3 convolution layer at the top and two 3 × 3 convolution layers at the bottom to expand the perceptual field and achieve better feature correlation. It can be expressed as follows: AMAB operation ensures maximum capture of feature information at different scales to achieve better feature relevance. Residual learning for each MARB helps ease the training difficulty of convolution networks and improves the information expression effectively. As mentioned above, this allows MARB to take advantage of available resources to obtain richer information in the SR image. Formally, we describe MARB as follows:

Asymmetric Multi-Weights Attention Block
Each pixel in the image does not exist independently, and they have some correlation with each other. The previous methods always designed channel attention or spatial attention for refining feature maps, thereby ignoring the relation of pixels. Pixel equal treatment is performed either on all channels or on all locations so that the accurate 3D weights can not be computed efficiently. Yang et al. [22] proposed to use 3D attention feature mapping to extract features to compensate for the imperfection of a 1D attention vector or 2D map in extracting features. The linear separability can be used to find the corresponding AMAB operation ensures maximum capture of feature information at different scales to achieve better feature relevance. Residual learning for each MARB helps ease the training difficulty of convolution networks and improves the information expression effectively.
As mentioned above, this allows MARB to take advantage of available resources to obtain richer information in the SR image. Formally, we describe MARB as follows:

Asymmetric Multi-Weights Attention Block
Each pixel in the image does not exist independently, and they have some correlation with each other. The previous methods always designed channel attention or spatial attention for refining feature maps, thereby ignoring the relation of pixels. Pixel equal treatment is performed either on all channels or on all locations so that the accurate 3D weights can not be computed efficiently. Yang et al. [22] proposed to use 3D attention feature mapping to extract features to compensate for the imperfection of a 1D attention vector or 2D map in extracting features. The linear separability can be used to find the corresponding neurons between a target neuron and other neurons. Borst et al. [25] determined that, for drosophila's visual orientation selectivity, lobule plate neurons determine the spatial receptive fields of neurons through direction-selective inputs from perceptual neurons T4 and T5 in the fly's visual system, significantly enhancing preferred directional features and zero-directional features, and performing directional information integration for efficient information flow. Inspired by these, we design an asymmetric multi-weights attention block (AMAB) that can captured the long-range dependencies directly from feature maps.
Firstly, asymmetric convolutions reinforce the salient features by horizontal and vertical directions, so a k × k convolution is factorized into a k × 1 and a 1 × k kernel [26]. To avoid introducing the computational overhead and extra parameters, the upper branch contains 3 × 1 and 1 × 3 asymmetric convolution kernels. Meanwhile, the 3 × 1 convolution compresses the number of channels with a reduction ratio R, and then another 1 × 3 convolution to expand original channels. We set R = 2, which reduces nearly half of operations and parameters while retaining the same receptive field and optimally balances the number of channels and input/output connectivity.
As shown in Figure 5, AMAB has three steps: the first step fuses features from horizontal and vertical directions via asymmetric convolutions. It can be calculated as follows: where F up is utilized as the input with multi-weights attention. compresses the number of channels with a reduction ratio R, and then another 1 × 3 convolution to expand original channels. We set R = 2, which reduces nearly half of operations and parameters while retaining the same receptive field and optimally balances the number of channels and input/output connectivity. As shown in Figure 5, AMAB has three steps: the first step fuses features from horizontal and vertical directions via asymmetric convolutions. It can be calculated as follows: where up F′ is utilized as the input with multi-weights attention.
The second step is to extract more effective features using multi-weights attention. All computing is an element-wise operation in the AMAB. Each pixel of the channel and spatial dimensions can be formulated as where ( ) σ ⋅ is the sigmoid function, which does not affect the importance of each pixel, but only the value of the pixel calculation process is limited to avoid excessive overruns. In multi-weights attention, each pixel is interconnected with other pixels, which allows the feature map to more realistically reflect the internal features of the image.
The weight generation is formulated as an energy function to reconstruct the attention mechanism while remaining lightweight. By adaptive selection among various layers, AMAB can capture features of different frequencies. The specific implementation of asymmetric multi-weights attention is shown in Algorithm 1.  The second step is to extract more effective features using multi-weights attention. All computing is an element-wise operation in the AMAB. Each pixel of the channel and spatial dimensions can be formulated as where σ(·) is the sigmoid function, which does not affect the importance of each pixel, but only the value of the pixel calculation process is limited to avoid excessive overruns. In multi-weights attention, each pixel is interconnected with other pixels, which allows the feature map to more realistically reflect the internal features of the image.
The weight generation is formulated as an energy function to reconstruct the attention mechanism while remaining lightweight. By adaptive selection among various layers, AMAB can capture features of different frequencies. The specific implementation of asymmetric multi-weights attention is shown in Algorithm 1.

Algorithm 1: The implementation of asymmetric multi-weights attention.
Input X: The feature matrix of H × W × C size.
Output X: The resultant matrix of H × W × C size.
(1) Set a 3 × 1 convolution layer and compress the channels to C/2.
(2) Use a 1 × 3 convolution layer and expand the channels to C.

Datasets and Metrics
The DIV2K [27] was the source of training and validation data for our model, including the first 800 images as training data and the rest for validation data. We trained the MAAN using the training dataset (DIV2K), which is utilized in most models. We also used four standard benchmark datasets as test datasets, including Set5 [28], Set14 [29], B100 [30], and Urban100 [31]. The original HR training images were downsampled with bicubic interpolation of scale factors ×2, ×3, and ×4, respectively, to obtain the corresponding LR images. The training images were subjected to random rotations of 90 • , 180 • , and 270 • and were manipulated by horizontal flipping. Traditionally, the PSNR has been used for the evaluation of computer vision tasks. However, the perception of structural information within images is measured by structure similarity (SSIM). Then, human vision is more sensitive to changes in luminance. The experiment results are calculated on the PSNR and SSIM by performing on the luminance (Y) channel of the converted YCbCr space. During the training stage, LR images were split into 64 × 64 patches, and the mini-batch size is set to 16. Our network adopted the ADAM optimizer [32] with β1 = 0.9; β2 = 0.999; and ε = 1 ×10 −8 to minimize the loss function. The initial learning rate was taken as lr = 1 × 10 −4 and halved for every 25,000 epochs. To ensure that our proposed MAAN had a lower model capacity, we set the number of FFBs to i = 4 and set C = 40 as the number of channels. We constructed our network utilizing Pytorch with an RTX 3080 GPU of 12G memory on the R5-5600 machine.

Number of FFBs
To better balance model capacity and reconstruction accuracy, we conducted experiments with different numbers of FFBs. As shown in Table 1, we analyzed the number of FFBs with scale factor ×3 on Urban100, the performance of SR can be improved as i grows, accompanying computational cost and parameter increase. To ensure that the proposed model is lightweight enough, we set i = 4 as the final model.

Effect of Reduction Ratio R Setting in AMAB
For analyzing the value of reduction ratio R in asymmetric convolution, we conducted two extra models for comparison. We set R = 1 and R = 4, respectively. In Figure 6, compared to the first two models, MAAN obtained the best results with the advantages of split channels, making the value of PSNR increase dramatically from 34.12 to 34.32, and the SSIM value consequently improved by 0.0021. Simultaneously, the number of parameters decreased by 28 K, and the computational cost, i.e., multi-ddds, dropped by 7.89G. Asymmetric convolution improved feature representation through channel changes. However, if the number of channels is compressed too low, there will also be a loss of some detailed features. Meanwhile, these changes also imply that effectively using the correlation of asymmetric multi-weights attention within the image can significantly assist in extracting accurate features from the image.

Effect of AMAB
In order to evaluate the superiority of the AMAB, we provided two models for comparison. We first replaced the AMAB with a plain channel attention (CA), namely MAAN-CA. Then, we removed the AMAB to obtain a MAAN-NOAMAB. As shown in Table 2, the performance of the MAAN-NOAMAB was much lower than that of the original MAAN, with a 0.10 dB drop in PSNR value. At the same time, the PSNR and SSIM values of MAAN-CA were 0.05 dB and 0.0007 less than our model, respectively. Notably, our proposed AMAB only had a small increased cost of a few extra parameters and memory with a higher reconstruction accuracy. These results prove the effectiveness and rationality of the AMAB.

Quantitative Evaluation
The quantitative evaluation results concerning the average PSNR and SSIM over the four benchmark datasets are shown in Table 3. For a more intuitive comparison, we give

Effect of AMAB
In order to evaluate the superiority of the AMAB, we provided two models for comparison. We first replaced the AMAB with a plain channel attention (CA), namely MAAN-CA. Then, we removed the AMAB to obtain a MAAN-NOAMAB. As shown in Table 2, the performance of the MAAN-NOAMAB was much lower than that of the original MAAN, with a 0.10 dB drop in PSNR value. At the same time, the PSNR and SSIM values of MAAN-CA were 0.05 dB and 0.0007 less than our model, respectively. Notably, our proposed AMAB only had a small increased cost of a few extra parameters and memory with a higher reconstruction accuracy. These results prove the effectiveness and rationality of the AMAB.

Quantitative Evaluation
The quantitative evaluation results concerning the average PSNR and SSIM over the four benchmark datasets are shown in Table 3. For a more intuitive comparison, we give the parameters and multi-adds. The parameters of the network model were derived from the number of operations computed in the convolutional window, i.e., generated by the output convolutional elements. In addition, multi-adds was employed to evaluate the model's computational complexity. It indicates the number of complex product operations for a single image. The multi-adds were computed with a 1280 × 720 output image. Overall, our model with nearly 668K parameters showed better reconstruction accuracy in terms of objective quality scores on most benchmark datasets. Most of the quantitative results of MAAN were either the best or the second-best from a lightweight modeling perspective. For the scale factor ×2, the PSNR gain of MAAN was slightly lower than that of the WMRN by 0.01 dB in Set5 and slightly lower than CARN, which was 0.01 dB in Set14. Unfortunately, CARN suffered from enormous network parameters and computational overhead. For the scale factor ×3, MAAN achieved the best SSIM of all methods and was superior to other modules for the PSNR value except for CARN. For the scale factor × 4, MAAN outperformed most methods and achieved comparable results running very few operations, which takes up fewer multi-adds with more moderate parameters. These advantages indicate that MAAN has a good reconfiguration capability and tends to produce high-quality human perception. Moreover, it can be found that existing models with fewer parameters have lower performance than our model. For example, although the multi-adds value of LESCRNN is much lower than that of our model, it has unsatisfactory results. Compared to the MWRN, our method achieved a performance improvement with slightly more parameters. These results prove the superiority of our proposed MAAN over the advanced models in attaining lightweight and efficient accuracy.   Figure 7, MAAN shows qualitative comparison over Set14 for scale factor ×2. Many methods cannot reconstruct the enlarged outline of the left side of the boy's hair strands, whereas MAAN can recover the hair details well, fully reflecting the role of AMAB and allowing a complete recovery of high-frequency details. In Figure 8, MAAN displays qualitative comparison over Set5 for scale factor ×3, most methods reconstruct images with severe blurring artifacts and fail to restore headpieces clearly. In contrast, MAAN removes artifacts and recovers a higher-quality image. Qualitative comparison over Urban100 for scale factor ×4 was as depicted in Figure 9, although CARN, LESRCNN, and ACNet can produce slightly sharper lines, their lines suffer from significant distortions. In comparison, MAAN combines multi-scale features to expand the receptive fields to capture richer multi-frequency information features. MAAN can overcome this point and have the effect of more accurately reflecting the details of the HR image, thus reconstructing satisfying results.  Figure 7, MAAN shows qualitative comparison over Set14 for scale factor ×2. Many methods cannot reconstruct the enlarged outline of the left side of the boy's hair strands, whereas MAAN can recover the hair details well, fully reflecting the role of AMAB and allowing a complete recovery of high-frequency details. In Figure 8, MAAN displays qualitative comparison over Set5 for scale factor ×3, most methods reconstruct images with severe blurring artifacts and fail to restore headpieces clearly. In contrast, MAAN removes artifacts and recovers a higher-quality image. Qualitative comparison over Urban100 for scale factor ×4 was as depicted in Figure 9, although CARN, LESRCNN, and ACNet can produce slightly sharper lines, their lines suffer from significant distortions. In comparison, MAAN combines multi-scale features to expand the receptive fields to capture richer multi-frequency information features. MAAN can overcome this point and have the effect of more accurately reflecting the details of the HR image, thus reconstructing satisfying results.      Qualitative comparison over Urban100 for scale factor ×4. Figure 9. Qualitative comparison over Urban100 for scale factor ×4.

Conclusions
In this paper, we present a lightweight MAAN for solving image SR tasks. MAAN first extracts low-resolution features by CFB. Then, the FFB utilizes multiple paths to complement the information exchange. Meanwhile, MARB can extend the perceptual field by extracting feature information at different scales. To further extract high-frequency detail features, an attention mechanism was introduced. AMAB in MARB assigns higher weights to more important features to learn all the previous layers better. Finally, the reconstruction module employed a combination of low-and high-frequency features to capture SR features more robustly. Experiments show that our final model, the MAAN, can achieve comparable performance to state-of-the-art lightweight models.
In the future, we will apply AMAB to improve the performance of water surface video super-resolution that requires more efficiency and lighter weight. MAAN is more suitable for small networks to be applied to other image tasks.